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Actual use of regression models in clinical prac- 
tice depends on model simplicity. Reducing the 
number of variables in a model contributes to this 
goal. The quality of a particular selection of vari- 
ables for a logistic regression model can be de- 
fined in terms of the number of variables selected 
and the model's discriminatory performance, as 
measured by the area under the ROC curve. A 
genetic algorithm was applied to search for the 
best variable combinations for modeling presence 
of myocardial infarction in a data set of patients 
with chest pain, Using an external validation 
set, the resulting model was compared with models 
constructed with standard backward, forward and 
stepwise methods of variable selection. The im- 
provemenl in discriminatory ability yielded by the 
genetic algorithm variable selection method was 
statistically significant (p < 0.02/ 

INTRODUCTION 

Logistic regression models are common in the field 
of medicine. Several studies on diagnosis of coro- 
nary disease involving logistic regression models 
have been published [1, 2, 3). Some of these mod- 
els were built to be used prospectively on previ- 
ously unseen cases. These are considered predic- 
tive models. Predictive models can be compared 
in terms of performance, robustness, explanatory 
power, and cost. Performance is often measured . 
by discriminatory ability (e.g., area under the Re- 
ceiver Operating Characteristic, or "ROC*, curve) 
and calibration (e.g., plots of expected versus ob- 
served results). Robustness can be interpreted as 
the ability to generalize the model to other data 
and to maintain good performance in presence of 
uncertainty and/or missing data items. Explana- 
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tory power is the ability to explain certain depen- 
dencies in the data and the mode) results. Cost 
can be measured as an aggregate of associated 
costs of obtaining the information. 

A factor that contributes to performance, robust- 
ness, explanatory power and cost is parsimony of 
the model. A smaller model (in terms of the num- 
ber of variables) is likely to (t) avoid over-fitting 
problems, thus performing and generalizing bet- 
ter, (ti) be less likely to fail due to missing data, 
(iit) be easier to explain, and (t'v) cost less, both in 
terms of data collection and computational effort. 

Traditionally, for logistic regression models, this 
issue has been addressed by stepwise forward, 
backward, and composite variable selection meth- 
ods [4]. The SAS statistical software system [5] 
calls these selection methods "forward", "back- 
ward" and "stepwise", respectively. These terms 
will be used in the remainder of this article. Al- 
though being well understood and relatively easy 
to compute, these methods consider the addition 
or removal of one variable at the time, conditional 
on the variables already selected. This sequential 
approach restricts the examined number of models 
severely. Another approach is to examine all pos- 
sible models. Given ti variables to choose from, 
the number of possible models is 2", which ren- 
ders this exhaustive approach in feasible with other 
than small numbers of variables. The SAS system 
offers this possibility, but only for 10 or fewer vari- 
ables. 

Heuristic approaches based on genetic algorithms 
have been used for selection of input variables 
and other parameters in artificial neural net- 
works [6, 7, 8], A search through bibliographic 
databases such as 1NSPEC, MEDLINE, Math- 
SciNet, Science Citation Index, HealthStar, and 
Applied Science and Technology Index, together 
with a multiple web search engine search, did not 



reveal any publications that deal with genetic al- 
gorithm variable selection for logistic regression. 
We have implemented a genetic algorithm based 
variable selection method for logistic regression 
models, and compare it to traditional sequential 
variable selection methods using data sets of pa- 
tients with chest pain. In the next section, we 
review the basic ideas behind genetic algorithms 
and explain how we applied them to variable se- 
lection. 

METHODS 

A genetic algorithm is a heuristic for function op- 
timisation where the extrema of the function (i.e., 
minima or maxima) cannot be established analyt- 
ically. A population of potential solutions is re- 
fined iteratively by employing a strategy inspired 
by Darwinistic evolution or natural selection. Ge- 
netic algorithms promote "survival of the fittest 11 . 

Given an initial population, often created ran* 
domly, the principal steps of a genetic algorithm 
are: 

1. Select parents from the current population to 
undergo genetic operations to form offspring. 
This is done stochastically with preference as- 
signed to individuals that yield higher func- 
tion values (i.e., the "fittest* individuals). 

2; Apply genetic operations such as crossover, 
mutation and inversion to the selected par- 
ents to form offspring. The operators are de- 
signed such that properties of the parents are 
reproduced in the offspring. 

3. Recombine the offspring and current popula- 
tion to form a new population. 

These steps are performed until some predefined 
stopping criterion is met. The selection method 
from a population of potential solutions, with pref- 
erence to "fittest" individuals, has given these 
types of algorithms the name "genetic", or some- 
times "evolutionary", algorithms. The individuals 
in a population are often called "chromosomes", 
built out of "genes" that represent the properties 
of the individual, and the function to optimize is 
referred to as a "fitness" function. Each iteration 
is called a "generation". A cycle of this process 
is shown in Figure 1. A pseudo-code skeleton for 
a genetic algorithm applying crossover, mutation 
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Figure 1: One cycle of evolution. Presented at the 
left of the figure is a population of four manipulated 
images of the human chromosome 1. Presented at the 
right is a bit- vector representation of the "genes" , rep- 
resenting the properties of the image manipulations. 
The two outer chromosomes are selected to be par- 
ents, and crossover is applied (after the 4th bit). The 
two middle chromosomes are then replaced by the off- 
spring to form the next generation population. 

and inversion as genetic operators, is shown in Fig- 
ure 2. For an in depth explanation and discussion 
of genetic algorithms, see (9, 10]. 



P *- initializePopuXationO 

evaluate (P) 

while (not stop(P)) do 

ParentsCl..3] 4- se lect Parent a (P) 
Offspring Cl] f- Crossover (Parents CO) 
Offspring [2] *- Hut at ion (Parents [23 ) 
Offspring [3] t- Inversion (Parents [3]) 
P «- recoBbine<P, Of f spring [1. .3] , 
ParonteCl-.3]) 

evaluate (P) 
done 

Figure 2: Pseudo-code for the genetic algorithm. P 
denotes the population, Parents CO denotes a set of se- 
lected individuals to undergo a genetic operation and 
Of t spring CO denotes the resulting set of individuals. 



The objective of variable selection for logistic re- 
gression models is to find parsimonious models 
that perform as well or better than the mode) that 
utilizes all available information. With this objec- 
tive in mind, we construct a measure of fitness for 
a selection of variables v. Given two tagged sets of 
data, a training or construction set C, and a hold- 



out or selection set S, a logistic regression mode) 
mc(v) can be constructed using <?, and evaluated 
using 5. The result is a numeric value cs\m c {v)) 
representing the performance of mc(v) on S. If 
the total number of variables for is u, and the num- 
ber of variables in the selection v is n, we propose 
the following fitness function: 

(I) /(f.CS) = cs(mc(v)) 

The first term rewards models with good perfor- 
mance, and the second term rewards parsimonious 
models. The parameter p determines the weight 
that is placed on such a reward. 

The genetic algorithm is configured by parameters 
such as: the fraction of the population to undergo 
each genetic operation, the size of the population, 
the fitness function, and the stopping criteria, A 
predefined number of the best encountered indi- 
viduals is returned as the result of one run of the 
algorithm. 

EXPERIMENTS 

The objective of our experiment was to compare 
the performance of a logistic regression model con- 
structed using the variable selection method based 
on the genetic algorithm with models constructed 
using standard forward, backward and stepwise 
variable selection. 

The models were constructed using a data set from 
Sheffield, England, of 500 patients with chest pain 
presenting at the emergency room (ER). The data 
set contained 43 predictor variables and one out- 
come, indicating whether these patients had a my- 
ocardial infarction (MI) or not. The prevalence of 
Ml was 30%. 

For the application of the genetic algorithm, the 
set was randomly split into a training part C, 
and a hold-out part S. The parts had 335 and 
165 cases, respectively. The chromosomes were 
represented as binary vectors, where the presence 
of a bit indicates the presence of the correspond- 
ing variable in the logistic model. The "genetic" 
operators crossover, mutation and inversion were 
used, and selection was done by universal stochas- 
tic sampling. This was also used in the selection 
of individuals to replace in the fixed size popula- 
tion in the recombination step. Initialization was 
random, and the stopping criteria was lack of im- 
provement in the average fitness of the population 
over 20 generations. The population size was set 



to 70, the probabilities for selection for crossover, 
mutation and inversion were 0.3, 0.1 and 0.1, re- 
spectively. Each "individual" (i.e., combination 
of variables selected) was transformed into a lo- 
gistic regression model mc(v) uting the SAS sys^ 
tern LOGISTIC procedure. The coefficients were 
calculated using the training set C. The perfor- 
mance measure e s {mc(v}) was the area under- 
neath the receiver operating characteristic (ROC) 
curve [11], computed as its equivalent statistic, the 
c-index [12] on the hold-out set 5. A p value of 
0.05 was empirically chosen for the fitness func- 
tion. , 

The genetic algorithm ran for 79 generations, re- 
quiring 1549 fitness function evaluations. The 
fittest model was selected as the result of the 
method, and labeled model "g". 

The logistic regression models with sequential 
variable selection were constructed using the SAS 
syBtem LOGISTIC procedure on the entire set 
with significance levels for entry and removal of 
0.05. They are termed model T, for forward se- 
lection, model w b", for backward selection, and 
model V, for stepwise selection. Additionally, a 
model °a n was constructed with all 43 available 
variables. 



RESULTS 

An overview of the variables selected by the differ- 
ent models can be seen in Table 1. The variables 
"gender", "right arm pain", "diaphoresis", "previ- 
ous angina" , and "ST elevation" were selected by 
all methods. Certain variables that were consis- 
tently selected by the sequential methods, such as 
"sharp pain", "episodic pain", "hypoperfusion", 
and "ST or T abnormality" were not selected by 
the genetic algorithm method. 

The final models were evaluated on an external 
validation set of 1253 cases collected in Edinburgh, 
Scotland. The resulting c-indices were statistically 
compared using the method of Hanley and Mc- 
Neil |13J. The results are in Table 2. Our model 
"g" was significantly better (p < 0.02) than any 
of the other models evaluated on the external val- 
idation set. There were no statistically significant 
differences between the other models. The corre- 
sponding ROC curves for all the models are shown 
in Figure 3. 
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DISCUSSION 
Although the computational effort spent by the 

variable combination* is considerably larger than 
the effort spent on doing sequential variable «l£ 

the SAS system when using the "best" selection 

with the 1 "best- selection option, do an exhaustive 
search through all 1024 possible variable combi- 
nauon, from a set of 10 variables. The number 
of models that an exhaustive search of all possible 
viable combinations from 43 variables requires is 
SJT S 2. , "* r0fl ° ,3 )'—bermuS larger' 

The sequential selection methods agreed to a high 
degree on the variables to include The eenetie 
XStfS? -"od selected J?*££ 
selected by the sequent.al methods, but also ali- 
en Curiously ,t selected variables euch as "die- 
betes", -severe chest pain", «ST elevation", "new 
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creases the chances that the mode] is over-fitting 
the data, and strongly suggests that generalisa- 
tion to other data is warranted. We plan to test 
the presented method in other domains. 

Another future area of investigation is the effect 
of changing parameters such as p in the genetic al- 
gorithm, and comparing the resulting models with 
models created with sequential selection using a 
wider range of entry/removal levels. 

CONCLUSION 

We have presented a genetic- algorithm- based vari- 
able selection method for a logistic regression that 
models the presence of myocardial infarction in 
a patient presenting at the ER with chest pain. 
The improvement of discriminatory performance 
achieved by this method was statistically signif- 
icant (p < 0.02) over models constructed using 
traditional variable selection methods. . 
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